Search CORE

10 research outputs found

RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

Author: Arandjelović Relja
Chen Hui
Chiu Han-Pang
Cordts Marius
Cummins Mark J
Faghri Fartash
Gong Yunchao
Hu Sixing
Huang Feiran
Hubert Tsai Yao-Hung
Levinson Jesse
Mahmood Faisal
Mao Junhua
Mithun Niluthpol Chowdhury
Mithun Niluthpol Chowdhury
Mithun Niluthpol Chowdhury
Mithun Niluthpol Chowdhury
Pronobis Andrzej
Razavian Sharif
Rottmann Axel
Schönberger Johannes L
Seymour Zachary
Toft Carl
Wang Tan
Wu Jianixn
Wu Yiling
Zadeh Amir
Zhou Bolei
Zolanvari SM
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/09/2020
Field of study

We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization.Comment: ACM Multimedia 202

arXiv.org e-Print Archive

Crossref

Cross-View Visual Geo-Localization for Outdoor Augmented Reality

Author: Chiu Han-Pang
Kumar Rakesh
Minhas Kshitij
Mithun Niluthpol Chowdhury
Oskiper Taragay
Samarasekera Supun
Sizintsev Mikhail
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/03/2023
Field of study

Precise estimation of global orientation and location is critical to ensure a compelling outdoor Augmented Reality (AR) experience. We address the problem of geo-pose estimation by cross-view matching of query ground images to a geo-referenced aerial satellite image database. Recently, neural network-based methods have shown state-of-the-art performance in cross-view matching. However, most of the prior works focus only on location estimation, ignoring orientation, which cannot meet the requirements in outdoor AR applications. We propose a new transformer neural network-based model and a modified triplet ranking loss for joint location and orientation estimation. Experiments on several benchmark cross-view geo-localization datasets show that our model achieves state-of-the-art performance. Furthermore, we present an approach to extend the single image query-based geo-localization approach by utilizing temporal information from a navigation pipeline for robust continuous geo-localization. Experimentation on several large-scale real-world video sequences demonstrates that our approach enables high-precision and stable AR insertion.Comment: IEEE VR 202

arXiv.org e-Print Archive

Diachronic cross-modal embeddings

Author: Andrew Galen
Bamler Robert
Bengio Yoshua
David
He Kaiming
Herbrich Ralf
Huang Xin
Kim Gunhee
Lau Jey Han
Mikolov Tomas
Mithun Niluthpol Chowdhury
Uricchio Tiberio
Wang L.
Yao T.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/09/2019
Field of study

This work has been partially funded by the CMU Portugal research project GoLocal Ref. CMUP-ERI/TIC/0046/2014, by the H2020 ICT project COGNITUS with the grant agreement no 687605 and by the FCT project NOVA LINCS Ref. UID/CEC/04516/2019. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.Understanding the semantic shifts of multimodal information is only possible with models that capture cross-modal interactions over time. Under this paradigm, a new embedding is needed that structures visual-textual interactions according to the temporal dimension, thus, preserving data's original temporal organisation. This paper introduces a novel diachronic cross-modal embedding (DCM), where cross-modal correlations are represented in embedding space, throughout the temporal dimension, preserving semantic similarity at each instant t. To achieve this, we trained a neural cross-modal architecture, under a novel ranking loss strategy, that for each multimodal instance, enforces neighbour instances' temporal alignment, through subspace structuring constraints based on a temporal alignment window. Experimental results show that our DCM embedding successfully organises instances over time. Quantitative experiments, confirm that DCM is able to preserve semantic cross-modal correlations at each instant t while also providing better alignment capabilities. Qualitative experiments unveil new ways to browse multimodal content and hint that multimodal understanding tasks can benefit from this new embedding.publishersversionpublishe

arXiv.org e-Print Archive

Crossref

Repositório da Universidade Nova de Lisboa

Learning Robust Visual-Semantic Retrieval Models with Limited Supervision

Author: Mithun Niluthpol Chowdhury
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

In recent years, tremendous success has been achieved in many computer vision tasks using deep learning models trained on large hand-labeled image datasets. In many applications, this may be impractical or infeasible, either because of the non-availability of large datasets or the amount of time and resource needed for labeling. In this respect, an increasingly important problem in the field of computer vision, multimedia and machine learning is how to learn useful models for tasks where labeled data is sparse. In this thesis, we focus on learning comprehensive joint representations for different cross-modal visual-textual retrieval tasks leveraging weak supervision, that is noisier and/or less precise but cheaper and/or more efficient to collect. Cross-modal visual-textual retrieval has gained considerable momentum in recent years due to the promise of deep neural network models in learning robust aligned representations across modalities. However, the difficulty in collecting aligned pairs of visual data and natural language description and limited availability such pairs in existing datasets makes it extremely difficult to train effective models, which would generalize well to uncontrolled scenarios as they are heavily reliant on large volumes of training data that closely mimic what is expected in the test cases. In this regard, we first present our work on developing a multi-faceted joint embedding framework-based video to text retrieval system that utilizes multi-modal cues (e.g., objects, action, place, sound) from videos to reduce the effect of limited data. Then, we describe our approach on training text to video moment retrieval systems leveraging only video-level text descriptions without any temporal boundary annotations. Next, we present our work on learning powerful joint representations of images and text from small fully annotated datasets with supervision from weakly-annotated web images. Extensive experimentation on different benchmark datasets demonstrates that our approaches show substantially better performance compared to baselines and state-of-the-art alternative approaches

Ezid

eScholarship - University of California

Diversity-Aware Multi-Video Summarization

Author: Amit K. Roy-Chowdhury
Niluthpol Chowdhury Mithun
Rameswar Panda
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Graph-based Multimodal Ranking Models for Multimodal Summarization

Author: Barnard Kobus
Celikyilmaz Asli
Chen Jingqiang
Choi Jinsoo
Diederik
Elhamifar E.
Faghri Fartash
Frome Andrea
He Xiaofei
Hubert Tsai Yao-Hung
Karypis George
Li Haoran
Li Haoran
Li Haoran
Li Haoran
Li Haoran
Lin Chin-Yew
Lin Tsung-Yi
Mithun Niluthpol Chowdhury
Pascanu Razvan
Paulus Romain
Plummer Bryan A.
Seneta Eugene
Sharma Vasu
Simon Ian
Simonyan Karen
Wan Xiaojun
Wang Liwei
Wang William Yang
Xiong Bo
Zhu Junnan
Zhu Junnan
Zhu Junnan
Zhu Junnan
Zhu Junnan
Zhu Yukun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref